Failure Detection and Partial Redundancy in Hpc
نویسنده
چکیده
KHARBAS, KISHOR H. Failure Detection and Partial Redundancy in HPC. (Under the direction of Dr. Frank Mueller.) To support the ever increasing demand of scientific computations, today’s High Performance Computing (HPC) systems have large numbers of computing elements running in parallel. Petascale computers, which are capable of reaching a performance in excess of one PetaFLOPS (1015 floating point operations per second), are successfully deployed and used at a number of places. Exascale computers with one thousand times the scale and computing power are projected to become available in less than 10 years. Reliability is one of the major challenges faced by exascale computing. With hundreds of thousands of cores, the mean time to failure is measured in minutes or hours instead of days or months. Failures are bound to happen during execution of HPC applications. Current fault recovery techniques focus on reactive ways to mitigate faults. Central to any kind of fault recovery method is the challenge of detecting faults and propagating this knowledge. The first half of this thesis work contributes to fault detection capabilities at the MPI-level. We propose two principle types of fault detection mechanisms: the first one uses periodic liveness checks while the second one makes on-demand liveness checks. These two techniques are experimentally compared for the overhead imposed on MPI applications. Checkpoint and restart (CR) recovery is one of the fault recovery methods which is used to reduce the amount of computation that could be lost. The execution state of the application is saved to a stable storage so that after a failure, computation can be restarted from a previous checkpoint rather than from the start of the application. Apart from storage overheads, CRbased fault recovery comes at an additional cost in terms of application performance because normal execution is disrupted when checkpoints are taken. Studies have shown that applications running at a large scale spend more than 50% of their total time saving checkpoints, restarting and redoing lost work. Redundancy is another fault tolerance technique, which employs redundant processes performing the same task. If a process fails, a replica of it can take over its execution. Thus, having redundant copies decreases the overall failure rate. The downside of this approach is that extra resources are used and there is an additional overhead of performing redundant communication and synchronization. The objective of the second half of the work is to model and analyze the benefit of using checkpoint/restart in coordination with redundancy at different degrees to minimize the total time and resources required for HPC applications to complete. c © Copyright 2011 by Kishor H. Kharbas
منابع مشابه
The Case for Modular Redundancy in Large-scale High Performance Computing Systems
Recent investigations into resilience of large-scale highperformance computing (HPC) systems showed a continuous trend of decreasing reliability and availability. Newly installed systems have a lower mean-time to failure (MTTF) and a higher mean-time to recover (MTTR) than their predecessors. Modular redundancy is being used in many mission critical systems today to provide for resilience, such...
متن کاملTowards highly available and scalable high performance clusters
In recent years, we have witnessed a growing interest in high performance computing (HPC) using a cluster of workstations. This growth made it affordable to individuals to have exclusive access to their own supercomputers. However, one of the challenges in a clustered environment is to keep system failure to the minimum and to achieve the highest possible level of system availability. High-Avai...
متن کاملCold standby redundancy optimization for nonrepairable series-parallel systems: Erlang time to failure distribution
In modeling a cold standby redundancy allocation problem (RAP) with imperfect switching mechanism, deriving a closed form version of a system reliability is too difficult. A convenient lower bound on system reliability is proposed and this approximation is widely used as a part of objective function for a system reliability maximization problem in the literature. Considering this lower bound do...
متن کاملTowards Adaptive Resilience in High Performance Computing
With the current growth in computing capabilities of high performance computing (HPC) systems, Exascale HPC systems are expected to arrive by 2020 [1]. As systems become larger and more complex, they also become more error prone [2]. The failure rate of HPC systems rapidly increases, such that, failures become the norm rather than the exception. Therefore, in such unreliable environment, to mai...
متن کاملLeveraging naturally distributed data redundancy to optimize collective replication
Dumping large amounts of related data simultaneously to local storage devices instead of a parallel file system is a frequent I/O pattern of HPC applications running at large scale. Since local storage resources are prone to failures and have limited potential to serve multiple requests in parallel, techniques such as replication are often used to enable resilience and high availability. Howeve...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011